fairness evaluation
Fair Play for Individuals, Foul Play for Groups? Auditing Anonymization's Impact on ML Fairness
Arcolezi, Héber H., Alishahi, Mina, Bendoukha, Adda-Akram, Kaaniche, Nesrine
Machine learning (ML) algorithms are heavily based on the availability of training data, which, depending on the domain, often includes sensitive information about data providers. This raises critical privacy concerns. Anonymization techniques have emerged as a practical solution to address these issues by generalizing features or suppressing data to make it more difficult to accurately identify individuals. Although recent studies have shown that privacy-enhancing technologies can influence ML predictions across different subgroups, thus affecting fair decision-making, the specific effects of anonymization techniques, such as $k$-anonymity, $\ell$-diversity, and $t$-closeness, on ML fairness remain largely unexplored. In this work, we systematically audit the impact of anonymization techniques on ML fairness, evaluating both individual and group fairness. Our quantitative study reveals that anonymization can degrade group fairness metrics by up to fourfold. Conversely, similarity-based individual fairness metrics tend to improve under stronger anonymization, largely as a result of increased input homogeneity. By analyzing varying levels of anonymization across diverse privacy settings and data distributions, this study provides critical insights into the trade-offs between privacy, fairness, and utility, offering actionable guidelines for responsible AI development. Our code is publicly available at: https://github.com/hharcolezi/anonymity-impact-fairness.
- North America > United States > California (0.04)
- Europe > France > Auvergne-Rhône-Alpes > Isère > Grenoble (0.04)
- North America > United States > Georgia > Fulton County > Atlanta (0.04)
- (3 more...)
- Law (1.00)
- Information Technology > Security & Privacy (1.00)
- Government (1.00)
- Information Technology > Security & Privacy (1.00)
- Information Technology > Data Science > Data Mining (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.94)
- Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.68)
The Quest for Reliable Metrics of Responsible AI
Rampisela, Theresia Veronika, Maistro, Maria, Ruotsalo, Tuukka, Lioma, Christina
The development of Artificial Intelligence (AI), including AI in Science (AIS), should be done following the principles of responsible AI. Progress in responsible AI is often quantified through evaluation metrics, yet there has been less work on assessing the robustness and reliability of the metrics themselves. We reflect on prior work that examines the robustness of fairness metrics for recommender systems as a type of AI application and summarise their key takeaways into a set of non-exhaustive guidelines for developing reliable metrics of responsible AI. Our guidelines apply to a broad spectrum of AI applications, including AIS.
- Europe > Denmark > Capital Region > Copenhagen (0.06)
- North America > United States > New York > New York County > New York City (0.05)
Which Demographic Features Are Relevant for Individual Fairness Evaluation of U.S. Recidivism Risk Assessment Tools?
Nguyen, Tin Trung, Xu, Jiannan, Nguyen-Le, Phuong-Anh, Lazar, Jonathan, Braman, Donald, Daumé, Hal III, Jelveh, Zubin
Despite its constitutional relevance, the technical ``individual fairness'' criterion has not been operationalized in U.S. state or federal statutes/regulations. We conduct a human subjects experiment to address this gap, evaluating which demographic features are relevant for individual fairness evaluation of recidivism risk assessment (RRA) tools. Our analyses conclude that the individual similarity function should consider age and sex, but it should ignore race.
- North America > United States > Maryland > Prince George's County > College Park (0.16)
- North America > United States > Illinois > Cook County > Chicago (0.05)
- North America > United States > California (0.05)
- (6 more...)
- Law > Statutes (1.00)
- Law > Government & the Courts (1.00)
- Government > Regional Government > North America Government > United States Government (0.95)
- Law > Civil Rights & Constitutional Law (0.94)
TrustSkin: A Fairness Pipeline for Trustworthy Facial Affect Analysis Across Skin Tone
Cabanas, Ana M., Pedro, Alma, Mery, Domingo
-- Understanding how facial affect analysis (F AA) systems perform across different demographic groups requires reliable measurement of sensitive attributes such as ancestry, often approximated by skin tone, which itself is highly influenced by lighting conditions. Using AffectNet and a MobileNet-based model, we assess fairness across skin tone groups defined by each method. Results reveal a severe underrepresentation of dark skin tones ( 2%), alongside fairness disparities in F1-score (up to 0.08) and TPR (up to 0.11) across groups. Grad-CAM analysis further highlights differences in model attention patterns by skin tone, suggesting variation in feature encoding. T o support future mitigation efforts, we also propose a modular fairness-aware pipeline that integrates perceptual skin tone estimation, model interpretability, and fairness evaluation. These findings emphasize the relevance of skin tone measurement choices in fairness assessment and suggest that IT A-based evaluations may overlook disparities affecting darker-skinned individuals. I. INTRODUCTION Predictive algorithms and biometric systems are increasingly used in critical areas such as healthcare, security, and human-computer interaction [1]. However, these systems remain prone to bias arising from demographic imbalances in training data and algorithmic design flaws [1]-[3]. In computer vision applications like EmotionAI and Facial Affect Analysis (FAA), such biases often result in consistent performance disparities across attributes like age, sex, and skin tone [4]-[6]. Given the sensitive deployment of FAA in psychological evaluation, driver monitoring, and educational feedback [1], [7], [8], ensuring fairness, transparency, and robustness across demographic groups is essential.
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- South America > Chile > Arica y Parinacota Region > Arica Province > Arica (0.04)
- Europe > Switzerland (0.04)
seeBias: A Comprehensive Tool for Assessing and Visualizing AI Fairness
Ning, Yilin, Ma, Yian, Liu, Mingxuan, Li, Xin, Liu, Nan
Fairness in artificial intelligence (AI) prediction models is increasingly emphasized to support responsible adoption in high-stakes domains such as health care and criminal justice. Guidelines and implementation frameworks highlight the importance of both predictive accuracy and equitable outcomes. However, current fairness toolkits often evaluate classification performance disparities in isolation, with limited attention to other critical aspects such as calibration. To address these gaps, we present seeBias, an R package for comprehensive evaluation of model fairness and predictive performance. seeBias offers an integrated evaluation across classification, calibration, and other performance domains, providing a more complete view of model behavior. It includes customizable visualizations to support transparent reporting and responsible AI implementation. Using public datasets from criminal justice and healthcare, we demonstrate how seeBias supports fairness evaluations, and uncovers disparities that conventional fairness metrics may overlook. The R package is available on GitHub, and a Python version is under development.
- North America > United States (0.14)
- Asia > Singapore > Central Region > Singapore (0.05)
- North America > Canada (0.04)
- Law (1.00)
- Health & Medicine > Therapeutic Area > Cardiology/Vascular Diseases (1.00)
- Health & Medicine > Epidemiology (0.93)
- Health & Medicine > Health Care Providers & Services (0.68)
FairEval: Evaluating Fairness in LLM-Based Recommendations with Personality Awareness
Sah, Chandan Kumar, Lian, Xiaoli, Xu, Tony, Zhang, Li
Recent advances in Large Language Models (LLMs) have enabled their application to recommender systems (RecLLMs), yet concerns remain regarding fairness across demographic and psychological user dimensions. We introduce FairEval, a novel evaluation framework to systematically assess fairness in LLM-based recommendations. FairEval integrates personality traits with eight sensitive demographic attributes,including gender, race, and age, enabling a comprehensive assessment of user-level bias. We evaluate models, including ChatGPT 4o and Gemini 1.5 Flash, on music and movie recommendations. FairEval's fairness metric, PAFS, achieves scores up to 0.9969 for ChatGPT 4o and 0.9997 for Gemini 1.5 Flash, with disparities reaching 34.79 percent. These results highlight the importance of robustness in prompt sensitivity and support more inclusive recommendation systems.
- North America > Canada > Quebec > Montreal (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- Europe > United Kingdom > England > Bristol (0.04)
- (4 more...)
- Media > Music (1.00)
- Leisure & Entertainment (1.00)
Fairness Evaluation with Item Response Theory
Xu, Ziqi, Kandanaarachchi, Sevvandi, Ong, Cheng Soon, Ntoutsi, Eirini
Item Response Theory (IRT) has been widely used in educational psychometrics to assess student ability, as well as the difficulty and discrimination of test questions. In this context, discrimination specifically refers to how effectively a question distinguishes between students of different ability levels, and it does not carry any connotation related to fairness. In recent years, IRT has been successfully used to evaluate the predictive performance of Machine Learning (ML) models, but this paper marks its first application in fairness evaluation. In this paper, we propose a novel Fair-IRT framework to evaluate a set of predictive models on a set of individuals, while simultaneously eliciting specific parameters, namely, the ability to make fair predictions (a feature of predictive models), as well as the discrimination and difficulty of individuals that affect the prediction results. Furthermore, we conduct a series of experiments to comprehensively understand the implications of these parameters for fairness evaluation. Detailed explanations for item characteristic curves (ICCs) are provided for particular individuals. We propose the flatness of ICCs to disentangle the unfairness between individuals and predictive models. The experiments demonstrate the effectiveness of this framework as a fairness evaluation tool. Two real-world case studies illustrate its potential application in evaluating fairness in both classification and regression tasks. Our paper aligns well with the Responsible Web track by proposing a Fair-IRT framework to evaluate fairness in ML models, which directly contributes to the development of a more inclusive, equitable, and trustworthy AI.
- Oceania > Australia > Victoria > Melbourne (0.04)
- Asia (0.04)
- Oceania > Australia > Australian Capital Territory > Canberra (0.04)
- (3 more...)
- Law (1.00)
- Education > Curriculum > Subject-Specific Education (0.47)
Ethical AI Governance: Methods for Evaluating Trustworthy AI
McCormack, Louise, Bendechache, Malika
Trustworthy Artificial Intelligence (TAI) integrates ethics that align with human values, looking at their influence on AI behaviour and decision-making. Primarily dependent on self-assessment, TAI evaluation aims to ensure ethical standards and safety in AI development and usage. This paper reviews the current TAI evaluation methods in the literature and offers a classification, contributing to understanding self-assessment methods in this field.
- North America > United States (0.14)
- Europe > Portugal > Lisbon > Lisbon (0.04)
- Europe > Italy > Campania > Naples (0.04)
- (3 more...)
- Information Technology > Security & Privacy (1.00)
- Government > Regional Government (0.68)
FairLENS: Assessing Fairness in Law Enforcement Speech Recognition
Wang, Yicheng, Cusick, Mark, Laila, Mohamed, Puech, Kate, Ji, Zhengping, Hu, Xia, Wilson, Michael, Spitzer-Williams, Noah, Wheeler, Bryan, Ibrahim, Yasser
Automatic speech recognition (ASR) techniques have become powerful tools, enhancing efficiency in law enforcement scenarios. To ensure fairness for demographic groups in different acoustic environments, ASR engines must be tested across a variety of speakers in realistic settings. However, describing the fairness discrepancies between models with confidence remains a challenge. Meanwhile, most public ASR datasets are insufficient to perform a satisfying fairness evaluation. To address the limitations, we built FairLENS - a systematic fairness evaluation framework. We propose a novel and adaptable evaluation method to examine the fairness disparity between different models. We also collected a fairness evaluation dataset covering multiple scenarios and demographic dimensions. Leveraging this framework, we conducted fairness assessments on 1 open-source and 11 commercially available state-of-the-art ASR models. Our results reveal that certain models exhibit more biases than others, serving as a fairness guideline for users to make informed choices when selecting ASR models for a given real-world scenario. We further explored model biases towards specific demographic groups and observed that shifts in the acoustic domain can lead to the emergence of new biases.
- Europe > United Kingdom > England (0.15)
- North America > United States > New York (0.05)
- North America > United States > Texas (0.04)
- (3 more...)
- Research Report > New Finding (0.66)
- Research Report > Experimental Study (0.46)
Fairness Evaluation for Uplift Modeling in the Absence of Ground Truth
Kadioglu, Serdar, Michalsky, Filip
The acceleration in the adoption of AI-based automated decision-making systems poses a challenge for evaluating the fairness of algorithmic decisions, especially in the absence of ground truth. When designing interventions, uplift modeling is used extensively to identify candidates that are likely to benefit from treatment. However, these models remain particularly susceptible to fairness evaluation due to the lack of ground truth on the outcome measure since a candidate cannot be in both treatment and control simultaneously. In this article, we propose a framework that overcomes the missing ground truth problem by generating surrogates to serve as a proxy for counterfactual labels of uplift modeling campaigns. We then leverage the surrogate ground truth to conduct a more comprehensive binary fairness evaluation. We show how to apply the approach in a comprehensive study from a real-world marketing campaign for promotional offers and demonstrate its enhancement for fairness evaluation.
- North America > United States > New York > New York County > New York City (0.05)
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- (9 more...)
- Research Report > Experimental Study (0.70)
- Research Report > New Finding (0.46)
- Marketing (1.00)
- Government > Regional Government > North America Government > United States Government (1.00)
- Information Technology (0.93)